feat(spark): Support data skipping based on partitioned RLI by cshuo · Pull Request #19013 · apache/hudi

cshuo · 2026-06-16T01:38:33Z

Describe the issue this Pull Request addresses

Spark data skipping based on record level index previously only supported global RLI lookup semantics. When the record level index is partitioned, Spark could not use the partitioned RLI metadata layout to narrow candidate files from record-key predicates.

Summary and Changelog

Add Spark data-skipping support for partitioned record level index.
Split global and partitioned RLI pruning into dedicated RecordLevelIndexSupport implementations.
Route RLI pruning based on the metadata record-index definition, using HoodieRecordIndex.isPartitioned.
Add partitioned RLI lookup by grouping (partition, recordKey) pairs against the corresponding metadata file groups.
Extract PartitionedRecordIndexFileGroupLookupFunction so it can be reused by Spark metadata index lookup and Spark datasource pruning.
Add functional SQL coverage for partitioned RLI pruning, including the max-partitions threshold behavior.

Impact

This improves Spark query pruning when partitioned RLI is enabled and a query contains record-key filters, optionally combined with partition filters.

For partitioned RLI, pruning is skipped when too many candidate partitions remain, avoiding expensive metadata fan-out. Global RLI behavior remains covered by the existing test path.

Risk Level

Medium.

The change touches Spark datasource file-index pruning and metadata-table lookup paths, but the implementation is scoped to record-level-index pruning and includes dedicated functional coverage. The targeted TestRecordLevelIndexWithSQL#testPartitionedRliPartitionsThreshold test passed locally.

Documentation Update

No documentation update is required. This change enables Spark pruning behavior for an existing metadata index mode and does not add a new public user-facing configuration.

Contributor's checklist

Read through contributor's guide
Change has been documented
Change has been tested
Change has been verified

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds partitioned RLI support to Spark data skipping, refactoring RecordLevelIndexSupport into a base class with global / partitioned implementations and gating the per-partition lookups behind a new threshold config. The Scala refactor and test inheritance via isPartitionedRli are nicely done, but I have one significant concern about how the downstream lookupIndexRecords handles multi-key + multi-file-group lookups — please take a look at the inline comment. Once that is clarified, this should be ready for a Hudi committer or PMC member to take it from here for a final review. A couple of small naming and magic-string nits below.

hudi-agent · 2026-06-16T03:41:52Z

+      val keys = HoodieListData.eager(JavaConverters.seqAsJavaListConverter(recordKeys).asJava)
+      val fileIdToPartitionMap: mutable.Map[String, String] = mutable.Map.empty
+      for (partition <- partitions) {
+        val recordIndexData = metadataTable.readRecordIndexLocationsWithKeys(keys, HOption.of(partition))


🤖 I'd like to double-check correctness when a data partition has more than one file group in the partitioned RLI. Looking at HoodieBackedTableMetadata.lookupIndexRecords for the dataTablePartition.isPresent() branch, it computes a single fileGroupIndex from only the first key (distinctSortedKeys.stream().findFirst().get()) and reads from fileSlicesForDataPartition.get(fileGroupIndex) — i.e. only one shard. Its comment says "all keys will be from the same shard index", which is true for the write path (SparkMetadataTableRecordLevelIndex.lookupRecords pre-partitions by file-group index before calling), but here we forward all recordKeys to that API per partition. If an IN (...) query has multiple keys that hash to different file groups inside a partition (which can happen once estimateFileGroupCount allocates more than 1 file group for a partition, bounded by record.level.index.max.filegroup.count = 10 by default), the lookup would only see the first key's shard, miss the others, and over-prune the candidate files → incorrect query results. The current tests use small data (≤5 records/partition) so each partition has just 1 file group and don't exercise this path. Could you confirm this, and either pre-bucket the keys by file group inside the loop (mirroring the write path) or add a test with RECORD_LEVEL_INDEX_MIN_FILE_GROUP_COUNT_PROP = 2+ and a multi-key IN filter to verify? @nsivabalan @yihua would appreciate your eyes on the assumption in lookupIndexRecords too.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-16T03:41:52Z

      "skipping over files")

+  val RECORD_INDEX_MAX_DATA_SKIPPING_PARTITIONS: ConfigProperty[Integer] = ConfigProperty
+    .key("hoodie.datasource.read.record.index.max.partitions")


🤖 nit: could you make the key name self-documenting by including the data-skipping context? hoodie.datasource.read.record.index.max.partitions reads like a property of the index itself (its max partition count) rather than a fallback threshold for data skipping. Something like hoodie.datasource.read.record.index.data.skipping.max.partitions would match the field name RECORD_INDEX_MAX_DATA_SKIPPING_PARTITIONS and be unambiguous in a config file without the Javadoc nearby.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-16T03:41:52Z

+    if (isPartitionedRli) {
+      s"${HoodieMetadataConfig.RECORD_LEVEL_INDEX_ENABLE_PROP.key} = 'true'"
+    } else {
+      "hoodie.metadata.record.index.enable = 'true'"


🤖 nit: this branch uses the legacy/alternative key "hoodie.metadata.record.index.enable" as a bare string literal, while the if-branch above uses HoodieMetadataConfig.RECORD_LEVEL_INDEX_ENABLE_PROP.key. It's worth at least a brief comment explaining that this is the backward-compatible alternative key for GLOBAL_RECORD_LEVEL_INDEX_ENABLE_PROP — otherwise it looks like an accidental inconsistency with the key used in rliEnableOpts.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

danny0405 · 2026-06-16T03:56:48Z

+    .sinceVersion("1.3.0")
+    .withDocumentation("The maximum number of candidate data table partitions that can be queried through the " +
+      "partitioned record level index for data skipping optimization. Each candidate partition requires a separate " +
+      "metadata table lookup. If the number of candidate partitions exceeds this value, the reader skips record " +


is this the num for per query task?

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this — partitioned RLI support is a useful addition. A couple of correctness/architectural concerns worth double-checking in the inline comments: handling of partitions that aren't yet in the bucketized RLI map, and whether the partition-count threshold should be a config (as the PR description suggests) rather than a hard-coded constant. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor readability nits in the new Scala and Java classes.

hudi-agent · 2026-06-18T09:34:27Z

+                               recordKeys: List[String]): Option[mutable.Map[String, String]] = {
+    val fileGroups = metadataTable.getBucketizedFileGroupsForPartitionedRLI(MetadataPartitionType.RECORD_INDEX)
+    val fileGroupCountPerDataPartition = fileGroups.asScala
+      .filter { case (partition, _) => partitions.contains(partition) }


🤖 Could you double-check what happens if a candidate partition exists in the data table but isn't yet present in getBucketizedFileGroupsForPartitionedRLI (e.g. during an RLI backfill, or any transient skew)? The .filter { partitions.contains(partition) } silently drops such partitions; the lookup then never queries them, and filterCandidateFiles ends up filtering all their files out of the candidate set — so a matching record there would be skipped at read time. The Flink path (hudi-flink/.../RecordLevelIndex.java#groupKeysByPartitionAndShard) explicitly checks-state here and refuses to prune; should the Spark path behave the same (fall back to Option.empty if any candidate partition is missing from the indexed map) instead of silently over-pruning? @yihua

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-18T09:34:27Z

+   * partitions, and fanning out a lookup to each one can add latency that outweighs the skipping benefit. Once the
+   * candidate partition count exceeds this threshold, pruning is skipped.
+   */
+  private[hudi] val MAX_PARTITIONS = 10


🤖 The PR description says: 'Wire HoodieFileIndex to instantiate record-index support through the new factory and read the new hoodie.datasource.read.record.index.max.partitions threshold option.' — but in the current revision this is a hard-coded private[hudi] val MAX_PARTITIONS = 10, and I can't find any read of that option anywhere in the source tree. Was the config dropped intentionally, or did the wiring get lost in a rebase? With Flink's analogous threshold set to 3 and Spark to 10 hard-coded, making this user-tunable seems valuable for tables with many partitions that still want RLI pruning.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-18T09:34:27Z

+        metadataTable.readRecordIndexLocationsWithKeys(HoodieListData.eager(keysToLookup), Option.of(partitionName));
+    try {
+      Map<String, HoodieRecordGlobalLocation> recordIndexInfo = recordIndexData.collectAsList().stream()
+          .collect(HashMap::new, (map, pair) -> map.put(pair.getKey(), pair.getValue()), HashMap::putAll);


🤖 nit: the three-arg Stream.collect() mutable-reduction form is a bit surprising here — could you use collect(Collectors.toMap(Pair::getKey, Pair::getValue)) instead? It makes the intent immediately obvious.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-18T09:34:27Z

+      val fileIdToPartitionMap = partitionedKeyRDD.mapPartitionsToPair(new PartitionedRecordIndexFileGroupLookupFunction(metadataTable))
+        .collect()
+        .asScala
+        .foldLeft(mutable.Map.empty[String, String]) { (fileIdToPartitionMap, location) =>


🤖 nit: using foldLeft to build a mutable Map mixes functional and imperative styles in a way that can trip up readers — have you considered simplifying to .map(t => t._2.getFileId -> t._2.getPartitionPath).toMap? That also lets lookupRecordKeys return Option[Map[String, String]] (immutable) since the map is never mutated after construction.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-06-18T10:40:40Z

CI report:

c5efb18 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

cshuo marked this pull request as draft June 16, 2026 01:38

cshuo force-pushed the data_skip_partitioned_rli_spark branch from bc06d84 to 1310010 Compare June 16, 2026 01:52

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 16, 2026

cshuo marked this pull request as ready for review June 16, 2026 03:26

hudi-agent reviewed Jun 16, 2026

View reviewed changes

danny0405 reviewed Jun 16, 2026

View reviewed changes

cshuo marked this pull request as draft June 16, 2026 03:57

cshuo force-pushed the data_skip_partitioned_rli_spark branch 2 times, most recently from ffa0f43 to dd38da2 Compare June 17, 2026 13:38

feat(spark): Support data skipping based on partitioned RLI

c5efb18

cshuo force-pushed the data_skip_partitioned_rli_spark branch from dd38da2 to c5efb18 Compare June 18, 2026 09:06

cshuo marked this pull request as ready for review June 18, 2026 09:07

hudi-agent reviewed Jun 18, 2026

View reviewed changes

danny0405 approved these changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): Support data skipping based on partitioned RLI#19013

feat(spark): Support data skipping based on partitioned RLI#19013
cshuo wants to merge 1 commit into
apache:masterfrom
cshuo:data_skip_partitioned_rli_spark

cshuo commented Jun 16, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 16, 2026

Uh oh!

hudi-agent Jun 16, 2026

Uh oh!

hudi-agent Jun 16, 2026

Uh oh!

danny0405 Jun 16, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 18, 2026

Uh oh!

hudi-agent Jun 18, 2026

Uh oh!

hudi-agent Jun 18, 2026

Uh oh!

hudi-agent Jun 18, 2026

Uh oh!

hudi-bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cshuo commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jun 18, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cshuo commented Jun 16, 2026 •

edited

Loading